Add Fortio-Envoy optimization guide by vaibhavk2 · Pull Request #29 · intel/optimization-zone

vaibhavk2 · 2026-05-01T22:34:44Z

Add Fortio-Envoy optimization guide and related documentation.

Signed-off-by: Vaibhav Shankar <vaibhav.shankar@intel.com>

rsiyer-intel · 2026-05-04T18:57:07Z

+
+## Overview
+
+Evaluates Envoy running as a TCP proxy in front of Fortio, which acts as the backend load generator. The benchmark focuses on proxy-path performance and behavior under load, measuring metrics such as QPS and latency. Both server-side and client-side components are used to generate traffic and collect results, with Envoy and Fortio running in Docker containers based on the images listed below:


As a first time reader, topology is unclear. Is it possible to provide a diagram to explain setup?
It is not clear what runs as the server vs client.
Is Fortio being used as both load generator as well as server?
Is client on a different host?

rsiyer-intel · 2026-05-04T19:02:52Z

+
+## Overview
+
+Evaluates Envoy running as a TCP proxy in front of Fortio, which acts as the backend load generator. The benchmark focuses on proxy-path performance and behavior under load, measuring metrics such as QPS and latency. Both server-side and client-side components are used to generate traffic and collect results, with Envoy and Fortio running in Docker containers based on the images listed below:


Consider starting with an overview followed by paragraphs explaining the two components, and then topology/setup used.

Overview

This tuning guide describes best known practises to optimize performance....... when you run Fortio Envoy...

Fortio

Envoy

Topology/setup

rsiyer-intel · 2026-05-04T19:04:14Z

+
+## CPU Utilization and CPU Quota
+
+The script applies Docker CPU quotas (`--cpus 16` for Fortio, `--cpus 8` for Envoy). On a high core-count server (eg., 128 cores/256Threads), Docker enforces these quotas via cgroup CPU BW control. The OS spreads threads across all cores but throttles aggregate CPU time, resulting in roughly **6 - 7% per-core utilization** across all server cores - not saturation. The CPU quota is the binding constraint, not the WL.


Which script? You can provide an example script similar to this - https://github.com/intel/optimization-zone/blob/main/software/kafka/README.md#example-system-startup-script

rsiyer-intel · 2026-05-04T19:08:52Z

+- **Event-driven, non-blocking I/O**: Each worker thread runs an independent libevent loop.
+- **`--concurrency N`**: Spawns N worker threads. Each thread owns its own listener socket and connection pool, so there is near-zero cross-thread coordination for established connections.
+- **TCP proxy mode** (used here): Envoy accepts a TCP connection on port 9090, opens a connection to Fortio on 8080, and shuttles bytes between them. No L7 parsing overhead.
+- **In mesh mode (`SECURE_MESH=true`)**: Adds mTLS - Envoy terminates the downstream TLS connection and re-originates a new TLS connection upstream, roughly doubling the cryptographic work per connection.


Earlier in the document you mentioned SECURE_MESH=true as “no Envoy sidecars, raw application performance” (direct mode), but here you have SECURE_MESH=true Envoy terminates the TLS connection?

rsiyer-intel · 2026-05-04T19:17:13Z

+
+2. **Cache coherency traffic**: Spin locks and atomic CAS operations on shared scheduler state cause cache line bouncing across all sockets. On a multi-socket NUMA system, cross-socket coherency traffic adds latency to every lock acquisition and scales with core count.
+
+3. **Kernel paths involved** (from perf flame graphs):


Is it possible to upload a flamegraph for reference?

rsiyer-intel · 2026-05-04T19:25:13Z

+
+### 1. NUMA Pinning (Most Impactful)
+
+Pin both Fortio and Envoy to a single NUMA node. This is the single most impactful optimization - it substantially reduces `native_queued_spin_lock_slowpath` overhead by keeping all memory allocations, thread migrations, and NIC interrupts on the same socket.


Pin both Fortio and Envoy to a single NUMA node. We mean on the server host?

vaibhavk2 added 3 commits May 1, 2026 09:12

Add Fortio-Envoy optimization guide.

243faf9

Signed-off-by: Vaibhav Shankar <vaibhav.shankar@intel.com>

Make chnages to Diagnostic checklist.

11d9df1

Signed-off-by: Vaibhav Shankar <vaibhav.shankar@intel.com>

Additional Readme file modifictions.

b6e792b

Signed-off-by: Vaibhav Shankar <vaibhav.shankar@intel.com>

rsiyer-intel reviewed May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Fortio-Envoy optimization guide#29

Add Fortio-Envoy optimization guide#29
vaibhavk2 wants to merge 3 commits intointel:mainfrom
vaibhavk2:envoy

vaibhavk2 commented May 1, 2026

Uh oh!

rsiyer-intel May 4, 2026

Uh oh!

rsiyer-intel May 4, 2026

Uh oh!

rsiyer-intel May 4, 2026

Uh oh!

rsiyer-intel May 4, 2026

Uh oh!

rsiyer-intel May 4, 2026

Uh oh!

rsiyer-intel May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		## Overview

		Evaluates Envoy running as a TCP proxy in front of Fortio, which acts as the backend load generator. The benchmark focuses on proxy-path performance and behavior under load, measuring metrics such as QPS and latency. Both server-side and client-side components are used to generate traffic and collect results, with Envoy and Fortio running in Docker containers based on the images listed below:


		## CPU Utilization and CPU Quota

		The script applies Docker CPU quotas (`--cpus 16` for Fortio, `--cpus 8` for Envoy). On a high core-count server (eg., 128 cores/256Threads), Docker enforces these quotas via cgroup CPU BW control. The OS spreads threads across all cores but throttles aggregate CPU time, resulting in roughly 6 - 7% per-core utilization across all server cores - not saturation. The CPU quota is the binding constraint, not the WL.


		2. Cache coherency traffic: Spin locks and atomic CAS operations on shared scheduler state cause cache line bouncing across all sockets. On a multi-socket NUMA system, cross-socket coherency traffic adds latency to every lock acquisition and scales with core count.

		3. Kernel paths involved (from perf flame graphs):


		### 1. NUMA Pinning (Most Impactful)

		Pin both Fortio and Envoy to a single NUMA node. This is the single most impactful optimization - it substantially reduces `native_queued_spin_lock_slowpath` overhead by keeping all memory allocations, thread migrations, and NIC interrupts on the same socket.

Conversation

vaibhavk2 commented May 1, 2026

Uh oh!

rsiyer-intel May 4, 2026

Choose a reason for hiding this comment

Uh oh!

rsiyer-intel May 4, 2026

Choose a reason for hiding this comment

Overview

Fortio

Envoy

Topology/setup

Uh oh!

rsiyer-intel May 4, 2026

Choose a reason for hiding this comment

Uh oh!

rsiyer-intel May 4, 2026

Choose a reason for hiding this comment

Uh oh!

rsiyer-intel May 4, 2026

Choose a reason for hiding this comment

Uh oh!

rsiyer-intel May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants